Hacking Wikipedia for Hyponymy Relation Acquisition

نویسندگان

  • Asuka Sumida
  • Kentaro Torisawa
چکیده

This paper describes a method for extracting a large set of hyponymy relations from Wikipedia. The Wikipedia is much more consistently structured than generic HTML documents, and we can extract a large number of hyponymy relations with simple methods. In this work, we managed to extract more than 1.4 × 106 hyponymy relations with 75.3% precision from the Japanese version of the Wikipedia. To the best of our knowledge, this is the largest machine-readable thesaurus for Japanese. The main contribution of this paper is a method for hyponymy acquisition from hierarchical layouts in Wikipedia. By using a machine learning technique and pattern matching, we were able to extract more than 6.3 × 105 relations from hierarchical layouts in the Japanese Wikipedia, and their precision was 76.4%. The remaining hyponymy relations were acquired by existing methods for extracting relations from definition sentences and category pages. This means that extraction from the hierarchical layouts almost doubled the number of relations extracted.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Co-STAR: A Co-training Style Algorithm for Hyponymy Relation Acquisition from Structured and Unstructured Text

This paper proposes a co-training style algorithm called Co-STAR that acquires hyponymy relations simultaneously from structured and unstructured text. In CoSTAR, two independent processes for hyponymy relation acquisition – one handling structured text and the other handling unstructured text – collaborate by repeatedly exchanging the knowledge they acquired about hyponymy relations. Unlike co...

متن کامل

Boosting Precision and Recall of Hyponymy Relation Acquisition from Hierarchical Layouts in Wikipedia

This paper proposes an extension of Sumida and Torisawa’s method of acquiring hyponymy relations from hierachical layouts in Wikipedia (Sumida and Torisawa, 2008). We extract hyponymy relation candidates (HRCs) from the hierachical layouts in Wikipedia by regarding all subordinate items of an item x in the hierachical layouts as x’s hyponym candidates, while Sumida and Torisawa (2008) extracted...

متن کامل

Bilingual Co-Training for Monolingual Hyponymy-Relation Acquisition

This paper proposes a novel framework called bilingual co-training for a largescale, accurate acquisition method for monolingual semantic knowledge. In this framework, we combine the independent processes of monolingual semanticknowledge acquisition for two languages using bilingual resources to boost performance. We apply this framework to largescale hyponymy-relation acquisition from Wikipedi...

متن کامل

Hypernym Discovery Based on Distributional Similarity and Hierarchical Structures

This paper presents a new method of developing a large-scale hyponymy relation database by combining Wikipedia and other Web documents. We attach new words to the hyponymy database extracted from Wikipedia by using distributional similarity calculated from documents on the Web. For a given target word, our algorithm first finds k similar words from the Wikipedia database. Then, the hypernyms of...

متن کامل

Extracting Hyponymic Relations from Chinese Free Corpus_Finally 分栏 精简版_5.rtf

Research on hyponymy acquisition is a basic and crucial problem in knowledge acquisition from text. In this paper we present a method of hyponymic relation acquisition and verification based on Chinese lexico-syntactic patterns. Firstly, we make use of removable lexicons and sentence patterns that have been semi-automatically obtained to analyze Chinese-isa patterns. Then we use an algorithm th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008